Imbalanced class distribution and performance evaluation metrics: A systematic review of prediction accuracy for determining model performance in healthcare systems

Focus on predictive algorithm and its performance evaluation is extensively covered in most research studies to determine best or appropriate predictive model with Optimum prediction solution indicated by prediction accuracy score, precision, recall, f1score etc. Prediction accuracy score from performance evaluation has been used extensively as the main determining metric for performance recommendation. It is one of the most widely used metric for identifying optimal prediction solution irrespective of dataset class distribution context or nature of dataset and output class distribution between the minority and majority variables. The key research question however is the impact of class inequality on prediction accuracy score in such datasets with output class distribution imbalance as compared to balanced accuracy score in the determination of model performance in healthcare and other real-world application systems. Answering this question requires an appraisal of current state of knowledge in both prediction accuracy score and balanced accuracy score use in real-world applications where there is unequal class distribution. Review of related works that highlight the use of imbalanced class distribution datasets with evaluation metrics will assist in contextualizing this systematic review.


Introduction
Key component in disease treatment is estimating outcome after treatment is initiated.An outcome is driven mainly by two critical issues; patient response and efficient treatment strategies on the part of healthcare givers.Developing effective and efficient strategies [1] for managing severely ill patients remains a major challenge for healthcare providers.Increasing morbidity and mortality as undesirable consequence of insufficient care practices of uncontrolled blood pressure by individuals.This is an important justification for adopting predictive learning technique capable of identifying important correlated factors associated with the incidence of hypertension.Predictive learning techniques assist in providing real-time solution to low detection rates among many segments of society.Increasing data generation capacity together with available tools necessary for data collection has contributed to the adoption of predictive modeling use in health care systems.Automated systems such as Internet of things (IoT) as an emerging paradigm [2,3] involving human interactions and interconnection of devices has contributed to the availability of large volumes of datasets being witnessed today.Characteristically, healthcare systems are associated with generation of large volumes of datasets brought on by connected medical device use such as remote patient monitoring and virtual assistant device for blood pressure, pulse, heart rate, diabetic monitors etc.Other connected devices include, connected contact lenses, glucose monitors, wearable, fitness tracking devices, virtual healthcare assistants, virtual dispensing assistants etc.Data generated from these applications have been explored in many research works to identify patterns of change using different predictive machine learning (ML) approach including non-clinical [4] to enhance disease diagnosis for improved treatment outcome.Assessing predictive modeling performance has become focused in many research works that includes review studies on feature selection methods and predictive model use in lung cancer radio mics [5].This study found random forest and support vector machine useful in classification tasks in review studies investigated.Additionally, the use of environmental parameters to improve deep learning model performance for the prediction of COVID-19 daily cases in 9 cities across three countries in different climatic zones using a variety of recurrent neural networks (LSTM) concludes that the inclusion of environmental parameters resulted in improved model performance [6].Diabetes prediction with applied data mining techniques such as random forest, support vector machines, logistic regression and naïve bayes showed that logistic regression achieved the highest prediction accuracy score of 82.46% as compared to others [7].Comparative study on model performance in predictive modeling of cardiac arrest in smokers using heart rate variability parameter proved that applying random forest technique achieved the best prediction accuracy score of 93.61% against 88.50% for logistic regression and 92.59% for decision tree classifier [8].
Evaluation in general involves three important qualities which are systematic, assessment and the determination of value, worth and significance.Systematic connotes an interpretation which is structured to give meaning.Different predictive techniques include the use of different or same evaluation metrics [9].Example, predictive evaluation metrics for ML techniques in classification analysis may be the same or differ from those used in regression analysis depending on the problem under consideration.The challenge here is when to use what and for what reason and to what benefit.Identifying the appropriate domain for use and for what reason such as evaluate performance for optimization or estimating the number of correctly classified patients for treatment default, number of patients with certain types of diseases etc could provide better use of predictive models.In this review, we offer a thorough discussion on various performance evaluation metrics in line with key research question: Effects of using prediction accuracy score as compared to balanced accuracy to determine appropriate machine learning model for predictive performance in datasets with unequal class distributions (imbalanced datasets) predominant in real-world applications.

Related works
It is important that the development and evaluation of ML techniques are made transparent and interpretable to allay any doubt about its usability in healthcare systems.Predictive model evaluation especially in healthcare and other real-world application systems with class distribution inequality must take into account the peculiarity of the dataset especially when assessing predictive model performance [10].Prediction accuracy score show results obtained from both observed and predicted values.It is predominantly used in classification problems where there are no dataset class imbalance and no skewed class examples.However one of the challenges identified in many research works is its use as the main performance metric to estimate best or appropriate machine learning model technique in real world applications such as healthcare systems where dataset class distribution inequality is prevalent.The challenge of using prediction accuracy as a measure of model performance is mentioned in a related review work that examined the prospects of machine learning use in clinical outcomes [11].Concerns regarding prediction accuracy score use is shared in a study of disease diagnosis with 20 machine learning techniques comprising Naïve Bayes, Support Vector Machine (SVM), K-Nearest Neighbors (KNN), Perceptron, Light Gradient Boosting Machine, extreme Gradient Boosting which addressed this challenge with f1-score evaluation metric [12].Prediction accuracy score obtained ranged between 49%-77%with various techniques but f1-score obtained ranged between 47%-82%.s Review study of artificial intelligence in disease diagnosis mentioned prediction accuracy as one of the evaluation parameters of interest [13].Similarly, comparative study of disease prediction with supervised ML techniques also identified prediction accuracy score as performance metric [14].Similar use of prediction accuracy [15] in assessing best ML technique for breast cancer prediction recorded an accuracy score of 98.7% for techniques such as decision trees and other ensemble techniques.ML principles and applications in real world systems have also been explored [16].Automatic prediction system for diabetic patients with several ML techniques for explainable artificial intelligence [17] concluded with prediction accuracy score of 81% and auc score of 84%.Additional studies to predict pressure ulcer nursing adverse event [18] using four ML techniques; decision trees, Support Vector Machines, Random Forest and Artificial Neural Networks achieved prediction accuracy score of 94.94% for Support vector machine, 97.93% for Decision trees, 99.88% for Random Forests and 79.02% for Artificial Neural Networks.Determination of appropriate ML algorithms to identify mental health problems [19] in its early stage with techniques such as Logistic Regression, Gradient Boosting, Neural Networks, K-Nearest Neighbor, Support Vector Machine and ensemble techniques showed overall prediction accuracy score of 88.80% achieved by Gradient Boosting.Additional studies to predict heart disease with ML algorithms such as K-Nearest Neighbors (KNN), Naive Bayes and Random Forest singled out Random Forest as the best performing classifier with prediction accuracy score of 95.63% [20].Further studies for ML use in cardiovascular disease prediction with learning techniques such as support vector machine, convolutional neural networks and boosting classifiers produced prediction roc_auc score of range 81%-97% [21].Diagnosis of breast cancer with learning techniques such as linear discriminant analysis (LDA) and Support vector machine (SVM) for various roles had prediction accuracy reading of 99.2% and 79.5% [22].However, the prediction of breast cancer with Decision tree and Random forest techniques [23] showed prediction accuracy score of 91.18% and 95.72% respectively.Additional ML application as decision support [24] for the detection of breast cancer through feature selection with ML techniques K-Nearest Neighbor, linear discriminant analysis and probabilistic neural network yielded accuracy score of 99.17%.Furthermore [25], prediction of breast cancer with ML based framework using ML techniques; Random Forest, Gradient Boosting, Support Vector Machine, Artificial Neural Network, and Multilayer Perception to achieve better classification accuracy using correlation-based feature selection together with recursive feature elimination extraction resulted in prediction accuracy score of 99.12%.Similarly, with weighting feature and backward elimination feature selection approach [26], application of Random forest ML technique to create computer-aided diagnostic system to distinguish breast cancer tumor between malignant and benign yielded prediction accuracy score of 99.7% and 99.82% respectively.Achieving higher precision and prediction accuracy using K-fold cross-validation with all features in model 2, all features without validation in model 1, with feature selection for model 3 and feature selection together with cross-validation [27] for model 4 using ML techniques; logistic regression, support vector machines, Naive Bayes, Decision trees and k-nearest neighbor, produced different prediction accuracy score at each stage.Highest accuracy score of importance recorded were; 98.83% for support vector machine, 97.17% for K-Nearest Neighbor and 97.88% for Logistic regression.Similarly, ML based model for early stage heart disease prediction with techniques support vector machine, K-nearest neighbor, random forest, Naive Bayes and decision tree using feature selection techniques (chi-square, ANOVA, and mutual information) to determine best fit model concluded that Random forest had the highest prediction accuracy score of 94.51% [28].
Related study for choice of best ML model for prediction of [29] breast cancer also had prediction accuracy score of 98% for Artificial Neural Network, 98% for Decision tree classifier, 99% for K-Nearest Neighbor, 98% for Logistic regression and 100% for Support vector machine.Risk prediction and diagnosis [30] of breast cancer through a comparative analysis of ML techniques to assess model efficiency and effectiveness with respect to prediction accuracy, precision, sensitivity and specificity proved that support vector machine had the highest prediction accuracy performance of 97.13% with the least error rate.Related study [31] to predict and diagnose breast cancer using ML techniques and to determine best model with evaluation metrics such as confusion matrix, accuracy and precision proved that Support Vector Machine among other ML techniques (Random Forest, Logistic Regression, Decision tree (C4.5) and K-Nearest Neighbors) achieved the greatest prediction accuracy score of 97.2%.The continuous use of models such as Support vector machines, Logistic regression and Random forest and Clustering in classification problems such as chronic disease diagnosis is emphasized in a related study that found them to be useful [32].Similarly, the prediction of treatment trend for patients suffering from hypothyroidism using sodium levothyroxine with ML techniques showed that using extra-trees achieves better prediction accuracy of 84%.[33].Following from this [34] is a predictive study of chronic kidney disease prediction with three ML techniques namely; Random forest, Support Vector machine and Decision tree together with recursive feature elimination technique.This study showed different prediction accuracy score in situations where feature selection is used and others where feature selection is not used.Prediction accuracy recorded with feature selection techniques were as follows; 99.8% for Random forest, 95.5% for Support vector machine and 98.6% for Decision tree.Additional studies on predictive modeling of chronic diseases such as sclerosis progression over 6 and 10 year period using ML techniques [35] such as K-nearest neighbor, Support vector machine, Decision tree and Logistic regression concluded with performance evaluation metric area under the curve score (auc), sensitivity, specificity, geometric mean and f1-score for each period and auc score for disease severity in the 6 th year are KNN 74%, Decision tree 74%, Linear regression 80% and Support vector machine 80%.Disease severity in the 10 th year had auc score KNN 67%, Decision tree 57%, Linear regression 67% and Support vector machine 73%.
Furthermore studies [36] for the detection of chronic kidney disease to show important correlations or predictive attributes using ML techniques (k-nearest neighbors, random forest, and neural networks) and 24 features used accuracy, root mean squared error (rmse) and fiscore measure as evaluation parameters.Predicted accuracy score of 99.3%forRandom forest classifier was achieved.Additional research to identify advanced chronic kidney disease with ML techniques; generalized linear model network, random forest, artificial neural network and natural language processing [37] showed improved prediction performance in accuracy score as reported.Prediction accuracy score for ML techniques used were; both for training data and testing data: Logistic regression 81.8% and 81.9%, Random forest 91.3% and 82.1%, Decision tree 86.0% and 82.1%.Its conclusion recommends improvement on achieved prediction accuracy score.Application of deep learning technique for prediction and classification of hypertension with related variables [38] showed the following prediction accuracy scores; Deep neural network: (75%, 73.9%, 74.3%, 74.3%) and Decision tree: (67.6%, 68.4%, 69%, 68%).Related study [39] on the prediction of hypertension using features such as patient demographics, past and current patient health condition and medical records for the determination of risk factors using artificial neural network showed prediction accuracy score of 82%.
Understanding disease symptoms is one sure way of effectively controlling and managing its treatment outcome.Predictive modeling [40] of heart disease risks and its symptoms using ML techniques will ensure effective patient care.Implementation of heart disease risk prediction using six ML techniques (support vector machine, Gaussian Naive Bayes, Logistic regression, light gradient boosting model, extreme gradient boosting and Random forest) showed the following predicted accuracy score; 80.23%, 78.68%, 80.32%, 77.04%, 73.77% and 88.5% respectively.
A population level-based approach [41] for predicting hypertension using ML techniques (extreme Gradient Boosting, Gradient Boosting Machine, Logistic Regression, Random forest, Decision tree and Linear Discriminant Analysis) had predicted accuracy score of 90% for (extreme Gradient Boosting, Gradient Boosting Machine, Logistic Regression and Linear Discriminant Analysis) as compared to 89% for Random forest and 83% for Decision tree.
1.1.0Accuracy score in non-health settings.Related research perspectives in other realworld applications such as spam message detection, fraud detection and risk estimation/forecasting are explored in this section.The risk of spam messaging and its impact on business operations are far reaching some of which include hacked systems and ransom demand payments, destruction of critical data and infrastructure and many others.Applying effective, efficient ML modeling technique that identifies important characteristics for the detection and subsequent prevention or destruction of threats posed continue to engage research attention.A study to detect spam threats [42] in emails and IoT platforms using Naıve Bayes, decision trees, neural networks and random forest together with other techniques had prediction accuracy score and precision score as follows; for Suppost Vector Machine and Naive Bayes 96.9%, precision 93.12% and Naive Bayes; 99.46%, precision 99.66%.Similarly, transformer-based embedding with ensemble learning techniques for spam detection showed prediction accuracy score of 99.91% [43].Furthermore application [44] of hybrid algorithm for the detection of malicious spam messaging in email with ML techniques Naive Bayes, Support vector machines, Logistic Regression and Random Forest showed predicted accuracy score of 96.15% for Naive Bayes, 96.15% for support vector machine, 98.08% for Logistic regression and 95.38% for Random forest respectively.Evaluation of automatic short message service performance [45] using Naive Bayes, BayesNet, C4.5, J48, Self-organizing map and Decision tree showed predicted accuracy score of 89.64%, 91.11%, 80.24%, 79.2%, 88.24% and 75.76% respectively.Comparative performance evaluation to improve prediction accuracy [46] of two ML models; support vector machine and random forest for the detection of junk mail spam showed prediction accuracy of models as; Support vector machine 93.52% and Random forest 91.41%.Related to improving prediction accuracy is the issue of improving training time and reducing prediction error rate.ML based hybrid bagging technique application [47] using random forest and decision tree (J48) for the analysis of email spam detection showed 98% prediction accuracy score.Other performance metrics evaluated include true negative rates, false positive rate and false negative rate, precision, recall and f-measure (f1-score).Increase in online transactions including online payments has also increased the risk of credit card fraud, ML based credit card fraud detection system [48] using genetic algorithm with the following learning techniques (Decision Tree, Random Forest, Logistic Regression, Artificial Neural Network, and Naive Bayes showed that applied genetic algorithm feature selection led to a predictive accuracy score of 100% for both Decision tree and Artificial neural network.Related to study [48] is financial fraud detection system in healthcare using ML techniques such as deep learning to address the challenge of credit card fraud monitoring [49].Applying ML techniques (Naive Bayes, Logistic Regression, K-Nearest Neighbor, Random Forest, and Sequential Convolutional Neural Network) resulted in the predicted accuracy score; 96.1%, 94.8%, 95.89%, 97.58%, and 92.3% respectively.Strategies have been adapted and adopted to deal with the challenge of fraud detection by various organizations.One such solution is provided by [50] which implemented ML based self-analyzing system to flag potential fraudulent activities for review.Case study approach [51] for a review of ML techniques (logistic regression, decision tree, random forest, K-Nearest Neighbor and extreme Gradient Boosting) in credit card fraud detection evaluated best model prediction performance using accuracy, recall, precision and f1score metrics.The study identified Logistic regression and K-nearest Neighbor as best performing classifiers.Implementation of fraud detection tools [52] to identify anomalies on financial applications using outlier detection techniques such as Local outlier factor, Isolation factor and Elliptic envelope and ML techniques (Random forest, Adaptive boosting and extreme gradient boosting) showed predicted accuracy score of 99.95%.Modeling [53] of medical visits by patients suffering from diabetes with ML techniques; logistic regression, support vector machine, linear discriminant analysis, quadratic discriminant analysis, extreme gradient boosting, neural networks and deep neural network obtained balanced accuracy score of 65.7%.Similarly, predicting length of stay [54] from admission to clinical ward with ML techniques random forest, decision trees, support vector machine, multi-layer perceptron, adaboost and gradient boost concluded with random forest as the best performing technique with balanced accuracy score of 72% at the initial stage of admission and 75% in-admission.However, an up-sampling approach [55] for breast cancer prediction using k-nearest neighbor, decision tree, random forest, neural networks, support vector machine and extreme gradient boosting obtained balanced accuracy score of 97.47%.
1.1.1Related works summary.Systematic review of related research works had key objectives and among them was the search for literature with the following characteristics; a focus on current state of knowledge with respect to ML techniques, applications and evaluations, research works with prediction accuracy score as an evaluation metric, research works in real-world context with unequal class distributions using relevant methodologies.Excluded from this review article search were defining specific search timeline and the motivation for not specifying search period was to include as many important related works as possible irrespective of its date of publication.Of particular interest was work on healthcare systems and other real-world applications (spam detections, fraud predictions, risk predictions etc).A summary of identified characteristics among selected reviewed literature with emphasis on prediction accuracy score as performance metric is presented in Table 1.Literature search sources were; Google scholar and other online journal databases such as IEEE, puhmed, hindawi journals, BioMed central, Pmc, Elsevier, Sciencedirect, organizational websites, online libraries and many other journals.A total of 80 articles were screened for (relevancy) and determined inclusion criteria was for related works in healthcare practice that had used predictive machine learning either in disease diagnosis, prediction, risk or treatment assessment.Literature of related works with ML applications in other relevant settings such as spam detection in mails, sms spamming were also considered.No time frame exclusion criteria was used, but about 80% of selected materials were mainly published works between 2016-2022 and a handful in 2023.Observations noticed in related literature used indicate extensive use of ML techniques in real-world applications for various reasons including serving as decision support systems.Predominantly used techniques include Random forest, Support vector machine, Logistic regression, K-Nearest Neighbor, Decision trees, Gradient boosting classifier and few ensemble techniques.The use of evaluation performance metrics such as precision, recall, f1-score, prediction accuracy and in some instance predicted positive and predicted negative values is observed.Of interest is the use of prediction accuracy as a predominant metric for assessing model performance found among all the related literature reviewed.
1.1.2Strengths and weaknesses identified in reviewed literature.In many of the literature reviewed, the pattern of high prediction accuracy score is observed including the use of more than one predictive technique modeling for comparative analysis.The use of predictive modeling in disease detection, diagnosis and treatment outcome for diseases of public concern together with predictive modeling in e-mail spam predictions, fraud detections, risk predictions etc is also observed.The desire for many is to address challenges with novel techniques from different perspectives.Differences in feature selection and optimization technique tools use to estimate variable importance and to improve on prediction performance is also indicated with varying outcome.Model performance evaluation is also indicated in almost all literature reviewed.However, there is a strong desire with few exceptions among majority of the reviewed literature to estimate best model performance significantly on prediction accuracy score irrespective of the problem domain and dataset class distribution.We also note the recorded high value of balanced accuracy score by [55] achieved using up-sampling optimization technique from [53,54].

Research question
The incidence of dataset class inequality in most real-world applications including healthcare systems and how it affects predictive modeling performance has received little attention in current research studies.Minority class contribution which is overlooked by most learning algorithms in such situations is rarely addressed by related research works resulting in skewed model performance evaluation estimate influenced mainly by the majority class contribution.As an example; in the prediction of patient treatment default; the number of non-defaulters may far exceed the number of defaulters by 100s of 1000s or in significant ratio such as 1:100000 but the key challenge is to correctly identify minority patient defaulters for necessary interventions.Therefore assessing model performance within this context with prediction accuracy score creates a challenge for better model performance assessment as minority class contribution is discounted.

Materials and methods
A systematic review of related research works through an adopted search strategy protocol for relevant literature with focus on characteristics such as current state of knowledge with respect to ML techniques, applications and evaluations, research works with prediction accuracy score as an evaluation metric, research works in real-world context with appropriate methodologies.Excluded from this review search were defining specific search timelines and the motivation for not specifying search period was to include as many important works as possible irrespective of publication date.Of particular interest were related works on healthcare systems and other real-world applications (spam detections, fraud predictions, risk predictions etc) with dataset class distribution inequality.
Our approach was to adopt guidelines emphasized in the preferred reporting items for systematic reviews and meta-analyses (PRISMA) protocol.These protocols were; designing the research question, adopting searches and search strategy, developing inclusion and exclusion criteria, designing data extraction plan to synthesis and draw conclusions, quality assessment criteria rule and developing strategies to analyzed the collected data.
2.0.1 Search strategy.Literature used was obtained from the following sources; PuMed, Google scholar, Web of science indexed journals, Scopus indexed journals (Springer nature, Hindawi, Elsevier, ScienceDirect, IEEEAccess, IEEEXplore) and many others.Search words included; predictive modeling in healthcare systems, machine learning prediction accuracy score, disease diagnosis with machine learning, machine learning prediction of disease (chronic kidney, hypertension, breast cancer, machine learning model performance evaluations, fraud detection with machine learning, detection of spam messages with machine learning, machine learning prediction with balanced accuracy score, dealing with class imbalance in machine learning etc.Our search period started from 2016 to ensure access to most materials since ML use in healthcare has been limited since its inception.

Inclusion criteria.
Our inclusion criteria for relevant articles was; model performance evaluation metrics, evaluation with accuracy scores, prediction with ML methods (techniques), ML applications in healthcare, ML use in healthcare (diagnosis, treatments, disease management), fraud detections, spam detections, risk predictions, junk mail predictions, ML in disease treatment default, deep learning applications in healthcare and many others.
2.0.3Exclusion criteria.Excluded from the search criteria was; ML application articles without performance evaluation, articles considered to be outside the realm of real-world application, articles with duplicate findings, articles with findings inconsistent with stated research objectives and reviewed articles.

Data extraction plan.
To assist in extracting relevant information from the sourced documents, every single article downloaded were placed in Mendeley Desktop including source documents from non-academic websites including industrial webpages with relevant information.
2.0.5 Quality assessment.Our quality assessment procedure was to follow through with all protocols stated in PRISMA guidelines and this resulted in the use of 68.6% of total articles sourced meeting all inclusion and exclusion criteria.

Evaluation metrics in classification
Brief description of performance evaluation metrics used in most machine learning applications for classification to demonstrate metric use and reasons for its use.

ROC score curve
Decrease in threshold leads to increase in more positive values and an increase in sensitivity with a decrease in specificity.Conversely an increase in threshold leads to more negative values and a high specificity with low sensitivity [57].

Confusion matrix.
Classification performance metric which consists of combination of predicted and actual values is the foundation on which precision, recall, roc_auc, specificity and prediction accuracy is derived.
2.1.4Log-loss.Log-loss measures the closeness of the prediction probability to the corresponding actual value or true value (0 or 1).A higher log-loss is indicative of divergence of the prediction probability from the actual value.
2.1.5Precision.Precision refers to the identification of relevant data points, its ability to identify true data points that are positive and classified by the model also as positive.False negative predictions are data points the model identifies as negative but are truly positive (false alarm).Precision = TP/(TP + FP) Where TP = true positives FP = false positives 2.1.6Recall.A models ability to identify all relevant class instances in a dataset.In certain situations, precision and recall can be combined to achieve optimal solution to a problem such as identifying all patients labeled as defaulters to disease treatment.This will lead to a high recall value but a low precision score.
2.1.7F1 score.F1 score is the harmonic mean of precision and recall that achieves optimal solution (combining precision and recall).It is the weighted mean average of precision and recall and used extensively in search engines for relevant information retrieval.

Evaluation metrics in regression
Some of the evaluation metrics used in regression analysis are as follows; • Mean Squared Error (MSE) • Mean Absolute Error (MAE) • Mean Absolute Percentage Error (MAPE) • Root Mean Squared Error (RMSE) • Root Mean Error (RME) • Adjusted R-Squared (Adjusted R 2 )

Balanced accuracy.
A metric used in imbalanced datasets for evaluation performance.It is the average of sensitivity and specificity.
Balanced accuracy = TPR + TNR 2 Where TPR = True Positive Rates TNR = True Negative Rates

Results
Performance evaluation metrics used in model assessments such as receiver operating characteristic curves, confusion matrix that describes misclassifications, average prediction accuracy score determination (balanced accuracy) is presented in

Balanced accuracy score estimation
Balanced accuracy score determination is presented in a flowchart diagram detailing other evaluation metrics that address minority class contribution such as true positives, true negatives, false positives, false negatives, true negative rates, true positive rates, false positive rates and false negative rates is displayed is presented in Fig 7

Evaluation model
Flowchart display evaluation model that addresses minority class contribution showing how balanced accuracy score is achieved together with other metrics that constitute false alarm is also presented.

Discussion
In this study, review of related literature on the use of predictive modeling in real-world applications with dataset class distribution inequality such as healthcare for either prediction of a

Contribution
This paper highlights an important ingredient in the choice of best machine learning model for prediction and places this choice under context.We also make an assertion that the supposedly higher prediction accuracy scores as obtained in some research findings with dataset class imbalance when compared with balanced accuracy scores of studies using similar ML techniques in the same context creates an erroneous impression of high performing models among individual ML techniques and for this reason the choice of best performing ML model based on prediction accuracy is problematic if context and purpose for prediction modeling is not considered.We have used only one evaluation metric (prediction accuracy score) but many others remain, we therefore encourage further discussions on the appropriate use of all other evaluation metrics for emphasis.

Conclusion
In the light of challenges identified with the use of prediction accuracy as a performance measure for best model determination with imbalanced dataset, we propose a novel evaluation approach that takes into account dataset class imbalance for predictive modeling use in healthcare systems context called Proposed Model Evaluation Approach (PMEA).PMEA, addresses the use of prediction accuracy as an evaluation performance metric challenge with balanced accuracy score derived from two most important evaluation metrics (True positive rates and True negative rates: TPR, TNR) to estimate model performance in datasets with unequal class distribution which can be generalized in similar contexts.The application of this model to practical business applications could generate more insight into appropriate model choice for enhanced performance Identifying appropriate evaluation metric(s) for performance assessment with imbalanced dataset class distribution will ensure a true determination of best performing prediction model for recommendation in context.We have examined literature, identified individual approaches to solving issues including context and examination of individual approaches.We have proposed an approach to deal with an identified challenge in context.This, we believe is not exhaustive, other evaluation assessments for its applicability in context will be examined in future research studies.

2 . 1 . 1
Prediction accuracy.In ML, prediction accuracy defines how well a model performs at predictions on unseen data.Prediction accuracy is only a fraction of model predictions that are correct [56].Prediction accuracy is illustrated as Accuracy = Number of correct Predictions Total number of Predictions Subsequently in classification, accuracy is calculated in terms of positive and negative predictions.Accuracy = TP + TN TP + TN + FP + FN Where TP = True positives, TN = True negatives, FP = False positives, FN = False negatives 2.1.2Receiver operating characteristic curve (roc_auc).roc_auc measures the performance of ML model's ability to differentiate between classes.A higher roc_auc curve score closer to 1 indicates favorable model performance at predicting 0 as 0 and 1 as 1.Some of the terms used in roc_auc curve are TPR (True positive rates/ Recall/Sensitivity) TPR/Recall/Sensitivity = TP TP + FN Specificity = TN TN + FP FPR = FP TN + FP Where FPR = False positive rates

Fig 3 .Fig 4 .
Fig 3. Accuracy prediction.Balanced accuracy diagram.Determining balanced accuracy involves the determination of other important performance indices such as true positives, true negative, true positive rates, true negative rates, false positives, false negatives, false positive rates, false negative rates which are necessary to assess model performance regarding class distribution inequality.https://doi.org/10.1371/journal.pdig.0000290.g003

Fig 5 .Fig 6 .Fig 7 .
Fig 5. Relevant and irrelevant distribution share.Proportion of relevant and irrelevant distribution share.Collected data distribution based on relevancy.Distribution share of relevant and irrelevant related research works is shown in this figure.https://doi.org/10.1371/journal.pdig.0000290.g005

Table 1 . Reviewed literature descriptions.
This research therefore explores other evaluation metrics that takes into account dataset class inequality to estimate reasonable prediction accuracy score for the determination of best or appropriate predictive technique performance.We therefore propose a novel evaluation approach for predictive modeling evaluation in healthcare systems context called Proposed Model Evaluation Approach (PMEA) which addresses minority class contribution challenges in predictive modeling.It is derived in combination with two most important evaluation metrics (True positive rates and True negative rates: TPR, TNR) to estimate more accurately best or appropriate model performance in context.